It looks like Raustats might not be what I am looking for to get SLA level census data quickly. Let’s try Census2016 from Hugh Parsonage.
rr full_2016_census <- Census2016_wide_by_SA2_year %>% filter(year == ‘2016’ ) head(full_2016_census)
Yes - this is what I need.
Loading other tables
rr ancestories_2016 <- Census2016_ancestories %>% filter(year == ‘2016’ ) countries_of_birth_2016 <- Census2016_countries_of_birth %>% filter(year == ‘2016’ ) languages_2016 <- Census2016_languages %>% filter(year == ‘2016’ ) religions_2016 <- Census2016_religions %>% filter(year == ‘2016’ )
Each of the four variables ancestory, country of birth, languages and religions are quite granular, and it may make sense to look at these variables at a lower level of granularity.
Ancestory:
rr download.file(‘https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&12490do0001_201912.xls&1249.0&Data%20Cubes&674EFC4CA0A3D8FDCA2584D30012B905&0&2019&18.12.2019&Latest’, ‘./Data/ancestry_classification.xls’, method = ‘libcurl’)
ancestry_classification_4dig <- readxl::read_xls(‘./Data/ancestry_classification.xls’, sheet = ‘Table 1.3’, skip = 7, col_names = c(‘X1’, ‘X2’, ‘Ancestory_Code_4’, ‘Ancestory’)) %>% filter(!is.na(Ancestory)) %>% select(Ancestory_Code_4, Ancestory)
ancestry_classification_1dig <- readxl::read_xls(‘./Data/ancestry_classification.xls’, sheet = ‘Table 1.1’, skip = 5, col_names = c(‘Ancestory_Code_1’, ‘Ancestory_Group’))%>% filter(!is.na(Ancestory_Group))
Country of birth
rr download.file(‘https://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&sacc_12690do0001_201903.xls&1269.0&Data%20Cubes&480BD730AF42D515CA2583BD007707C5&0&2016&15.03.2019&Latest’, ‘./Data/country_classification.xls’, method = ‘libcurl’)
country_classification_4dig <- readxl::read_xls(‘./Data/country_classification.xls’, sheet = ‘Table 1.3’, skip = 7, col_names = c(‘X1’, ‘X2’, ‘Country_Code_4’, ‘Country’)) %>% filter(!is.na(Country)) %>% select(-X1, -X2)
country_classification_2dig <- readxl::read_xls(‘./Data/country_classification.xls’, sheet = ‘Table 1.2’, skip = 6, col_names = c(‘X1’, ‘Country_Code_2’, ‘Country_Name_2’)) %>% filter(!is.na(Country_Name_2)) %>% select(-X1)
country_classification_1dig <- readxl::read_xls(‘./Data/country_classification.xls’, sheet = ‘Table 1.1’, skip = 5, col_names = c(‘Country_Code_1’, ‘Country_Group’)) %>% filter(!is.na(Country_Group))
Language
rr download.file(‘https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&ASCL_12670DO0001_201703.xls&1267.0&Data%20Cubes&F84620CF6E13F7E8CA257FF1001E68A7&0&2016&28.03.2017&Latest’, ‘./Data/language_classification.xls’, method = ‘libcurl’)
language_classification_4dig <- readxl::read_xls(‘./Data/language_classification.xls’, sheet = ‘Table 1.3’, skip = 8, col_names = c(‘X1’, ‘X2’, ‘X3’, ‘X4’, ‘Language_Code_3’, ‘Language’)) %>% filter(!is.na(Language)) %>% select(-X1, -X2, -X3, -X4)
language_classification_1dig <- readxl::read_xls(‘./Data/language_classification.xls’, sheet = ‘Table 1.1’, skip = 5, col_names = c(‘Language_Code_1’, ‘Language_Group’)) %>% filter(!is.na(Language_Group))
Religion
rr download.file(‘https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&ASCRG_12660DO0001_201707.xls&1266.0&Data%20Cubes&B3EAFE3FE6180D37CA257FF1001E673C&0&2016&14.07.2017&Latest’, ‘./Data/religion_classification.xls’, method = ‘libcurl’)
religion_classification_3dig <- readxl::read_xls(‘./Data/religion_classification.xls’, sheet = ‘Table 1.2’, skip = 6, col_names = c(‘X1’, ‘Religion_Code_3’, ‘Religion’)) %>% filter(!is.na(Religion)) %>% select(-X1)
religion_classification_1dig <- readxl::read_xls(‘./Data/religion_classification.xls’, sheet = ‘Table 1.1’, skip = 5, col_names = c(‘Religion_Code_1’, ‘Religion_Group’)) %>% filter(!is.na(Religion_Group))
Aggregate at the SA2 level - add unless variable contains median, average, persons_per_bedroom
census_2016_all_vars <- Census2016_wide_by_SA2_year %>%
filter(year == '2016') %>%
rowwise() %>%
mutate(sa2_id = paste0(substr(sa2_code, 1, 1), substr(sa2_code, 6, 9))) %>%
filter(isMissing == FALSE) %>%
mutate(percent_female = female/persons,
percent_defacto = defacto_persons/persons,
percent_married = married_persons/persons,
percent_indig = indig_persons/persons,
percent_born_in_australia = born_in_australia/persons,
percent_unit = flat_or_unit/n_dwellings,
percent_mortgage = dwelling_owned_mortgage/n_dwellings,
percent_rent = dwelling_rented/n_dwellings)
census_2016_means <- census_2016_all_vars %>%
select(median_age, median_household_income, average_household_size,
persons_per_bedroom, median_weekly_rent, median_annual_mortgage, sa2_id) %>%
group_by(sa2_id) %>%
summarise_all(mean, na.rm = TRUE)
census_2016_counts <- census_2016_all_vars %>%
select(n_dwellings, persons, female, male,
married_persons, married_females, married_males, defacto_persons,
defacto_females, defacto_males, notmarried_persons,
notmarried_females, notmarried_males, indig_persons,
indig_males, indig_females, non_indig_persons,
non_indig_females, non_indig_males, not_stated_indig_persons,
not_stated_indig_males, not_stated_indig_females,
born_in_australia, born_overseas, country_not_stated,
separate_house, flat_or_unit, housing_other_or_not_stated, semi_or_townhouse,
dwelling_owned_outright, dwelling_owned_mortgage, dwelling_other_or_not_stated,
dwelling_rented, sa2_id) %>%
group_by(sa2_id) %>%
summarise_all(sum, na.rm = TRUE) %>%
mutate(percent_female = female/persons,
percent_defacto = defacto_persons/persons,
percent_married = married_persons/persons,
percent_indig = indig_persons/persons,
percent_born_in_australia = born_in_australia/persons,
percent_unit = flat_or_unit/n_dwellings,
percent_mortgage = dwelling_owned_mortgage/n_dwellings,
percent_rent = dwelling_rented/n_dwellings)
So what I need is weighted demographic data for each of the polling places based on the number of people from each SLA2 who voted at the polling place. Since we don’t know who voted where, and who can vote at all, we are making the naive assumptions that * Voters at each SLA are similar * Voters are representitive of census respondents at the SLA2 level.
Download and load polling places by SA1
download.file('https://www.aec.gov.au/Elections/Federal_Elections/2016/files/polling-place-by-sa1s-2016.xlsx', './Data/polling-place-by-sa1s-2016.xlsx', method = 'libcurl')
trying URL 'https://www.aec.gov.au/Elections/Federal_Elections/2016/files/polling-place-by-sa1s-2016.xlsx'
Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 31087463 bytes (29.6 MB)
==================================================
downloaded 29.6 MB
polling_place_data <- readxl::read_xlsx('./Data/polling-place-by-sa1s-2016.xlsx')
Aggregate polling place data to SA2
rr polling_place_sa2 <- polling_place_data %>% mutate(sa2_id = floor(SA1_id / 100)) %>% group_by(year, state_ab, div_nm, pp_id, pp_nm, sa2_id) %>% summarise(votes = sum(votes))
Combine with demographic data and aggregate
polling_place_demog <- polling_place_sa2 %>%
mutate(sa2_id = as.character(sa2_id)) %>%
inner_join(census_2016_all_vars)
polling_place_demog_means <- polling_place_demog %>%
select(year, state_ab, div_nm, pp_id, pp_nm, sa2_id, votes,
median_age, median_household_income, average_household_size,
persons_per_bedroom, median_weekly_rent, median_annual_mortgage,
percent_female, percent_defacto, percent_married, percent_indig,
percent_born_in_australia, percent_unit, percent_mortgage, percent_rent) %>%
group_by(year, state_ab, div_nm, pp_id, pp_nm) %>%
summarise_at(vars(median_age, median_household_income,
average_household_size, persons_per_bedroom, median_weekly_rent,
median_annual_mortgage, percent_female, percent_defacto,
percent_married, percent_indig, percent_born_in_australia,
percent_unit, percent_mortgage,
percent_rent), funs(weighted.mean(., w=votes)))
Add in 2pp at the polling booth level
rr election_2pp <- twoparty_pollingbooth_download()
trying URL 'https://github.com/ropenscilabs/eechidna/raw/master/extra-data/tpp_pp.rda'
Content type 'application/octet-stream' length 1677810 bytes (1.6 MB)
==================================================
downloaded 1.6 MB
rr polling_place_2pp <- polling_place_demog_means %>% group_by() %>% rename(StateAb = state_ab, DivisionNm = div_nm, PollingPlace = pp_nm) %>% mutate(DivisionNm = toupper(DivisionNm), PollingPlace = toupper(PollingPlace)) %>% left_join(election_2pp %>% filter(year == 2016))
Joining, by = c(\year\, \StateAb\, \DivisionNm\, \PollingPlace\)
Check for missing data
polling_place_2pp %>%
summarise_all(funs(sum(is.na(.))))
Which booths are null?
polling_place_2pp %>%
filter(is.na(TotalVotes)) %>%
tabyl(PollingPlace)
PollingPlace n percent
ABSENT 150 0.25
POSTAL 150 0.25
PRE-POLL 150 0.25
PROVISIONAL 150 0.25
So the Absent, Postals, Pre-Poll and Provisional votes aren’t in this table. Let’s come back to those…
polling_place_2pp %>%
summarise_all(funs(sum(is.null(.))))
Remove the rows with NAs
polling_place_2pp_clean <- polling_place_2pp %>%
filter(!is.na(TotalVotes))
polling_place_2pp_clean %>%
summarise_all(funs(sum(is.na(.))))
Which polling stations have missing data? Not too concerned about post code, as there are some special booths
polling_place_2pp_clean %>%
filter(is.na(median_age)|is.na(Latitude))
Looks like mobile teams and prepoll centres, and only latitude and longitude. Will remove the Brand mobile team, as the demographic data does not look valid.
polling_place_2pp_clean<- polling_place_2pp_clean %>%
dplyr::filter(PollingPlaceID != 65161)
polling_place_2pp_clean %>%
ggplot(aes(x = LNP_Percent/100)) + stat_density(geom="line", colour = 'blue') +
theme_classic(base_size = 16) +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: LNP 2 Party Preferred Percentage', x = '2PP Percentage',
y = 'Density', subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(x = ALP_Percent/100)) + stat_density(geom="line", colour = 'red') +
theme_classic(base_size = 16) +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', x = '2PP Percentage',
y = 'Density', subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(x = Swing/100)) + stat_density(geom="line", colour = 'purple') +
theme_classic(base_size = 16) +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: Swing to Incumbent', x = 'Swing',
y = 'Density', subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(x = median_household_income)) + stat_density(geom="line", colour = 'purple') +
theme_classic(base_size = 16) +
scale_x_continuous(labels=scales::dollar) +
labs(title = '2016 Census: Median Income', x = 'Median Income',
y = 'Density', subtitle = 'by Polling Booth, Unweighted')
Can we look at these distributions by state?
polling_place_2pp_clean %>%
ggplot(aes(x = ALP_Percent/100, colour = StateAb)) +
stat_density(geom="line", position = 'dodge') +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', x = '2PP Percentage',
y = 'Frequency', subtitle = 'by Polling Booth, Unweighted')
What about by median income
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = median_household_income, colour = StateAb)) +
geom_point() +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::dollar) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', x = 'Booth Median Income',
y = 'ALP 2pp Percentage', subtitle = 'by Polling Booth, Unweighted') +
facet_wrap(~StateAb, nrow = 4)
What about comparing NSW electorates? There seems to be an odd separation in income bands for low ALP 2pp. Could this be a regional vs city difference?
fp_booth_16 <- firstpref_pollingbooth_download() %>%
filter(year == 2016)
trying URL 'https://github.com/ropenscilabs/eechidna/raw/master/extra-data/fp_pp.rda'
Content type 'application/octet-stream' length 3227934 bytes (3.1 MB)
==================================================
downloaded 3.1 MB
polling_2cp <- read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseTcpByCandidateByPollingPlaceDownload-20499.csv', skip = 1)
polling_2pp <- read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseTppByPollingPlaceDownload-20499.csv', skip = 1)
fp_booth_2016 <- read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseStateFirstPrefsByPollingPlaceDownload-20499-NSW.csv', skip = 1) %>%
rbind(read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseStateFirstPrefsByPollingPlaceDownload-20499-VIC.csv', skip = 1)) %>%
rbind(read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseStateFirstPrefsByPollingPlaceDownload-20499-QLD.csv', skip = 1)) %>%
rbind(read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseStateFirstPrefsByPollingPlaceDownload-20499-SA.csv', skip = 1)) %>%
rbind(read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseStateFirstPrefsByPollingPlaceDownload-20499-WA.csv', skip = 1)) %>%
rbind(read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseStateFirstPrefsByPollingPlaceDownload-20499-TAS.csv', skip = 1)) %>%
rbind(read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseStateFirstPrefsByPollingPlaceDownload-20499-NT.csv', skip = 1)) %>%
rbind(read_csv('https://results.aec.gov.au/20499/Website/Downloads/HouseStateFirstPrefsByPollingPlaceDownload-20499-ACT.csv', skip = 1))
coalition_contest_2016 <- fp_booth_2016 %>%
group_by(DivisionNm, PartyNm, HistoricElected) %>%
summarise(OrdinaryVotes = sum(OrdinaryVotes)) %>%
filter(PartyNm %in% c('Liberal', 'Country Liberals (NT)',
'Liberal National Party of Queensland',
'The Nationals')) %>%
group_by(DivisionNm) %>%
top_n(1) %>%
select(DivisionNm, PartyNm)
Selecting by OrdinaryVotes
If we look at a couple of the states where high income booths tend to vote strongly for the coalition as well as lower income booths, we can see that some (but not all) of the lower income booths are contested by The Nationals. This indicares (not surprisingly) that Nationals voters and Liberal voters are different socio-economically, or possibly that city and country coalition voters differ.
polling_place_2pp_clean %>%
mutate(DivisionNm = stringr::str_to_title(DivisionNm)) %>%
inner_join(coalition_contest_2016) %>%
filter(StateAb == 'NSW') %>%
ggplot(aes(y = ALP_Percent/100, x = median_household_income, colour = PartyNm)) +
geom_point(size = 3) +
theme_classic(base_size = 16) + scale_color_manual(values = c('blue', 'dark green')) +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::dollar) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', x = 'Booth Median Income',
y = 'ALP 2pp Percentage', colour = 'Coalition Party',
subtitle = 'by Polling Booth, Unweighted (NSW)')
polling_place_2pp_clean %>%
mutate(DivisionNm = stringr::str_to_title(DivisionNm)) %>%
inner_join(coalition_contest_2016) %>%
filter(StateAb == 'VIC') %>%
ggplot(aes(y = ALP_Percent/100, x = median_household_income, colour = PartyNm)) +
geom_point(size = 3) +
theme_classic(base_size = 16) + scale_color_manual(values = c('blue', 'dark green')) +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::dollar) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', x = 'Booth Median Income',
y = 'ALP 2pp Percentage', colour = 'Coalition Party',
subtitle = 'by Polling Booth, Unweighted (VIC)')
This effect is less clear in states where the Nationals aren’t as prominent, either because the Nationals aren’t as prominent (SA, WA, TAS), or are merged with the Liberals (QLD).
polling_place_2pp_clean %>%
mutate(DivisionNm = stringr::str_to_title(DivisionNm)) %>%
inner_join(coalition_contest_2016) %>%
filter(StateAb == 'WA') %>%
ggplot(aes(y = ALP_Percent/100, x = median_household_income, colour = PartyNm)) +
geom_point(size = 3) +
theme_classic(base_size = 16) + scale_color_manual(values = c('blue', 'dark green')) +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::dollar) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', x = 'Booth Median Income',
y = 'ALP 2pp Percentage', colour = 'Coalition Party',
subtitle = 'by Polling Booth, Unweighted (WA)')
Perhaps we would be better off using the geographical classifications from the AEC.
library(rvest)
webpage <- read_html("http://results.aec.gov.au/20499/Website/HouseDivisionClassifications-20499-NAT.htm")
Division_Classifications <- webpage %>%
html_nodes("#divisionClassifications") %>%
html_table(fill = TRUE) %>%
.[[1]]
Division_Classifications <- Division_Classifications %>%
filter(Division != 'Total Enrolment')
polling_place_2pp_clean<- polling_place_2pp_clean %>%
mutate(Division = stringr::str_to_title(DivisionNm)) %>%
inner_join(Division_Classifications)
Joining, by = "Division"
The graph below shows that the booths that have a low ALP 2pp and a low median income are primarily rural booths. This relationship seems stronger than the Lib/Nat split.
polling_place_2pp_clean %>%
filter(StateAb == 'NSW') %>%
ggplot(aes(y = ALP_Percent/100, x = median_household_income, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::dollar) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage',
y = 'ALP 2pp Percentage',
x = 'Booth Median Income', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted (NSW)')
Looking at all states we see a similar relationship, although less strong than in NSW.
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = median_household_income, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::dollar) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Median Booth Income', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
What about some of the other variables?
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = average_household_size, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Average Household Size', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = percent_female, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_x_continuous(labels=scales::percent) +
scale_y_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Percent Female', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
select(ALP_Percent, Swing, median_age, median_household_income, average_household_size,
persons_per_bedroom, median_weekly_rent, median_annual_mortgage,
percent_female, percent_defacto, percent_born_in_australia,
percent_unit, percent_mortgage, percent_rent) %>%
cor %>%
kable()
| ALP_Percent | Swing | median_age | median_household_income | average_household_size | persons_per_bedroom | median_weekly_rent | median_annual_mortgage | percent_female | percent_defacto | percent_born_in_australia | percent_unit | percent_mortgage | percent_rent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ALP_Percent | 1.0000000 | -0.2924398 | -0.4148188 | 0.0049899 | 0.1785014 | 0.3698085 | 0.1675435 | 0.1230552 | 0.1245266 | 0.1270139 | -0.3301750 | NA | NA | NA |
| Swing | -0.2924398 | 1.0000000 | 0.0592447 | 0.1044271 | -0.0872137 | 0.0165807 | 0.0817794 | 0.0964162 | 0.0734776 | 0.0042180 | -0.0448560 | NA | NA | NA |
| median_age | -0.4148188 | 0.0592447 | 1.0000000 | -0.4804955 | -0.5175758 | -0.5943939 | -0.4116337 | -0.4221270 | -0.0574141 | -0.0502159 | 0.5197535 | NA | NA | NA |
| median_household_income | 0.0049899 | 0.1044271 | -0.4804955 | 1.0000000 | 0.4183950 | 0.3362658 | 0.8029106 | 0.8644605 | 0.1982974 | -0.0693950 | -0.3959055 | NA | NA | NA |
| average_household_size | 0.1785014 | -0.0872137 | -0.5175758 | 0.4183950 | 1.0000000 | 0.3887206 | 0.3408666 | 0.3284517 | -0.0032833 | -0.5164718 | -0.3020229 | NA | NA | NA |
| persons_per_bedroom | 0.3698085 | 0.0165807 | -0.5943939 | 0.3362658 | 0.3887206 | 1.0000000 | 0.4120754 | 0.4015767 | 0.0035249 | 0.0094906 | -0.5938012 | NA | NA | NA |
| median_weekly_rent | 0.1675435 | 0.0817794 | -0.4116337 | 0.8029106 | 0.3408666 | 0.4120754 | 1.0000000 | 0.9276979 | 0.4307971 | -0.1337758 | -0.6013994 | NA | NA | NA |
| median_annual_mortgage | 0.1230552 | 0.0964162 | -0.4221270 | 0.8644605 | 0.3284517 | 0.4015767 | 0.9276979 | 1.0000000 | 0.3628219 | -0.1183373 | -0.5499270 | NA | NA | NA |
| percent_female | 0.1245266 | 0.0734776 | -0.0574141 | 0.1982974 | -0.0032833 | 0.0035249 | 0.4307971 | 0.3628219 | 1.0000000 | -0.1010107 | -0.1145354 | NA | NA | NA |
| percent_defacto | 0.1270139 | 0.0042180 | -0.0502159 | -0.0693950 | -0.5164718 | 0.0094906 | -0.1337758 | -0.1183373 | -0.1010107 | 1.0000000 | 0.2383005 | NA | NA | NA |
| percent_born_in_australia | -0.3301750 | -0.0448560 | 0.5197535 | -0.3959055 | -0.3020229 | -0.5938012 | -0.6013994 | -0.5499270 | -0.1145354 | 0.2383005 | 1.0000000 | NA | NA | NA |
| percent_unit | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA |
| percent_mortgage | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA |
| percent_rent | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 |
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = persons_per_bedroom, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Persons per Bedroom', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = percent_unit, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Percent of Dwellings - Unit', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = percent_mortgage, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Percent of Dwellings - Under Mortgage', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = percent_rent, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Percent of Dwellings - Renting', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = percent_indig, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Percent Indigeneous', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = percent_born_in_australia, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Percent Born in Australia', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
polling_place_2pp_clean %>%
ggplot(aes(y = ALP_Percent/100, x = percent_defacto, colour = Demographic)) +
geom_point(size = 1) +
theme_classic(base_size = 16) + scale_color_brewer(palette = "Dark2") +
theme(legend.position = 'bottom') +
scale_y_continuous(labels=scales::percent) +
scale_x_continuous(labels=scales::percent) +
labs(title = '2016 Election: ALP 2 Party Preferred Percentage', y = 'ALP 2pp Percentage',
x = 'Percent in a Defacto Relationship', colour = 'Region',
subtitle = 'by Polling Booth, Unweighted')
Add extra variables Look at lagged results Build models
Can we build a simple linear model to predict 2pp
library(MASS)
library(car)
alp_2pp_lm_demog <-
lm(ALP_Percent ~ median_household_income*Demographic + percent_indig +
percent_female + percent_defacto + percent_born_in_australia +
percent_rent + median_weekly_rent + median_age,
data = polling_place_2pp_clean)
summary(alp_2pp_lm_demog)
Call:
lm(formula = ALP_Percent ~ median_household_income * Demographic +
percent_indig + percent_female + percent_defacto + percent_born_in_australia +
percent_rent + median_weekly_rent + median_age, data = polling_place_2pp_clean)
Residuals:
Min 1Q Median 3Q Max
-63.536 -8.043 -0.237 7.809 73.017
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.096e+02 5.585e+00 19.631 < 2e-16
median_household_income -5.176e-04 1.767e-05 -29.284 < 2e-16
DemographicOuter Metropolitan -1.583e+00 1.866e+00 -0.849 0.39614
DemographicProvincial -1.072e+01 2.381e+00 -4.505 6.75e-06
DemographicRural -3.616e+01 1.977e+00 -18.288 < 2e-16
percent_indig 1.650e+01 2.761e+00 5.976 2.39e-09
percent_female 1.676e+01 1.135e+01 1.477 0.13971
percent_defacto 1.514e+02 7.955e+00 19.027 < 2e-16
percent_born_in_australia -9.365e+00 2.329e+00 -4.022 5.83e-05
percent_rent -1.277e+01 2.510e+00 -5.088 3.70e-07
median_weekly_rent 3.201e-02 3.118e-03 10.265 < 2e-16
median_age -8.904e-01 4.622e-02 -19.264 < 2e-16
median_household_income:DemographicOuter Metropolitan -3.321e-06 2.131e-05 -0.156 0.87619
median_household_income:DemographicProvincial 8.332e-05 3.051e-05 2.731 0.00634
median_household_income:DemographicRural 3.091e-04 2.498e-05 12.372 < 2e-16
(Intercept) ***
median_household_income ***
DemographicOuter Metropolitan
DemographicProvincial ***
DemographicRural ***
percent_indig ***
percent_female
percent_defacto ***
percent_born_in_australia ***
percent_rent ***
median_weekly_rent ***
median_age ***
median_household_income:DemographicOuter Metropolitan
median_household_income:DemographicProvincial **
median_household_income:DemographicRural ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.15 on 7775 degrees of freedom
(52 observations deleted due to missingness)
Multiple R-squared: 0.3995, Adjusted R-squared: 0.3984
F-statistic: 369.5 on 14 and 7775 DF, p-value: < 2.2e-16
anova(alp_2pp_lm_demog)
Analysis of Variance Table
Response: ALP_Percent
Df Sum Sq Mean Sq F value Pr(>F)
median_household_income 1 172 172 1.1620 0.281093
Demographic 3 453116 151039 1022.4845 < 2.2e-16 ***
percent_indig 1 31754 31754 214.9672 < 2.2e-16 ***
percent_female 1 48 48 0.3245 0.568953
percent_defacto 1 74787 74787 506.2809 < 2.2e-16 ***
percent_born_in_australia 1 59450 59450 402.4580 < 2.2e-16 ***
percent_rent 1 5547 5547 37.5484 9.352e-10 ***
median_weekly_rent 1 1397 1397 9.4542 0.002114 **
median_age 1 110129 110129 745.5389 < 2.2e-16 ***
median_household_income:Demographic 3 27664 9221 62.4249 < 2.2e-16 ***
Residuals 7775 1148503 148
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(alp_2pp_lm_demog)
vif(alp_2pp_lm_demog)
GVIF Df GVIF^(1/(2*Df))
median_household_income 7.316624 1 2.704926
Demographic 18261.455172 3 5.131637
percent_indig 1.706153 1 1.306198
percent_female 1.650000 1 1.284523
percent_defacto 1.614552 1 1.270650
percent_born_in_australia 4.563213 1 2.136168
percent_rent 3.299426 1 1.816432
median_weekly_rent 6.269462 1 2.503889
median_age 3.407630 1 1.845977
median_household_income:Demographic 13002.761451 3 4.849228
# crPlots(alp_2pp_lm_demog)
# ceresPlots(alp_2pp_lm_demog)
library(gvlma)
gv_alp_2pp_lm_demog <- gvlma(alp_2pp_lm_demog)
summary(gv_alp_2pp_lm_demog)
Call:
lm(formula = ALP_Percent ~ median_household_income * Demographic +
percent_indig + percent_female + percent_defacto + percent_born_in_australia +
percent_rent + median_weekly_rent + median_age, data = polling_place_2pp_clean)
Residuals:
Min 1Q Median 3Q Max
-63.536 -8.043 -0.237 7.809 73.017
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.096e+02 5.585e+00 19.631 < 2e-16
median_household_income -5.176e-04 1.767e-05 -29.284 < 2e-16
DemographicOuter Metropolitan -1.583e+00 1.866e+00 -0.849 0.39614
DemographicProvincial -1.072e+01 2.381e+00 -4.505 6.75e-06
DemographicRural -3.616e+01 1.977e+00 -18.288 < 2e-16
percent_indig 1.650e+01 2.761e+00 5.976 2.39e-09
percent_female 1.676e+01 1.135e+01 1.477 0.13971
percent_defacto 1.514e+02 7.955e+00 19.027 < 2e-16
percent_born_in_australia -9.365e+00 2.329e+00 -4.022 5.83e-05
percent_rent -1.277e+01 2.510e+00 -5.088 3.70e-07
median_weekly_rent 3.201e-02 3.118e-03 10.265 < 2e-16
median_age -8.904e-01 4.622e-02 -19.264 < 2e-16
median_household_income:DemographicOuter Metropolitan -3.321e-06 2.131e-05 -0.156 0.87619
median_household_income:DemographicProvincial 8.332e-05 3.051e-05 2.731 0.00634
median_household_income:DemographicRural 3.091e-04 2.498e-05 12.372 < 2e-16
(Intercept) ***
median_household_income ***
DemographicOuter Metropolitan
DemographicProvincial ***
DemographicRural ***
percent_indig ***
percent_female
percent_defacto ***
percent_born_in_australia ***
percent_rent ***
median_weekly_rent ***
median_age ***
median_household_income:DemographicOuter Metropolitan
median_household_income:DemographicProvincial **
median_household_income:DemographicRural ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.15 on 7775 degrees of freedom
(52 observations deleted due to missingness)
Multiple R-squared: 0.3995, Adjusted R-squared: 0.3984
F-statistic: 369.5 on 14 and 7775 DF, p-value: < 2.2e-16
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = alp_2pp_lm_demog)
library(MASS)
library(car)
alp_2pp_lm_demog <-
lm(ALP_Percent ~ median_household_income + Demographic + percent_indig +
percent_defacto + percent_born_in_australia +
percent_rent + median_weekly_rent + median_age,
data = polling_place_2pp_clean)
summary(alp_2pp_lm_demog)
Call:
lm(formula = ALP_Percent ~ median_household_income + Demographic +
percent_indig + percent_defacto + percent_born_in_australia +
percent_rent + median_weekly_rent + median_age, data = polling_place_2pp_clean)
Residuals:
Min 1Q Median 3Q Max
-63.420 -8.135 -0.258 8.075 82.895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.235e+02 2.624e+00 47.065 < 2e-16 ***
median_household_income -4.488e-04 1.369e-05 -32.780 < 2e-16 ***
DemographicOuter Metropolitan -1.567e+00 4.622e-01 -3.391 0.000699 ***
DemographicProvincial -2.691e+00 6.274e-01 -4.290 1.81e-05 ***
DemographicRural -1.417e+01 6.081e-01 -23.303 < 2e-16 ***
percent_indig 1.292e+01 2.769e+00 4.665 3.14e-06 ***
percent_defacto 1.524e+02 7.826e+00 19.480 < 2e-16 ***
percent_born_in_australia -1.122e+01 2.141e+00 -5.240 1.65e-07 ***
percent_rent -1.580e+01 2.440e+00 -6.475 1.01e-10 ***
median_weekly_rent 2.988e-02 2.768e-03 10.796 < 2e-16 ***
median_age -1.123e+00 4.175e-02 -26.907 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.3 on 7779 degrees of freedom
(52 observations deleted due to missingness)
Multiple R-squared: 0.3846, Adjusted R-squared: 0.3838
F-statistic: 486.2 on 10 and 7779 DF, p-value: < 2.2e-16
anova(alp_2pp_lm_demog)
Analysis of Variance Table
Response: ALP_Percent
Df Sum Sq Mean Sq F value Pr(>F)
median_household_income 1 172 172 1.1345 0.2869
Demographic 3 453116 151039 998.2850 < 2.2e-16 ***
percent_indig 1 31754 31754 209.8795 < 2.2e-16 ***
percent_defacto 1 74823 74823 494.5410 < 2.2e-16 ***
percent_born_in_australia 1 56968 56968 376.5284 < 2.2e-16 ***
percent_rent 1 6712 6712 44.3618 2.915e-11 ***
median_weekly_rent 1 2532 2532 16.7346 4.342e-05 ***
median_age 1 109539 109539 723.9951 < 2.2e-16 ***
Residuals 7779 1176949 151
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(alp_2pp_lm_demog)
vif(alp_2pp_lm_demog)
GVIF Df GVIF^(1/(2*Df))
median_household_income 4.285242 1 2.070083
Demographic 4.657881 3 1.292304
percent_indig 1.675652 1 1.294470
percent_defacto 1.525730 1 1.235204
percent_born_in_australia 3.765674 1 1.940534
percent_rent 3.043732 1 1.744629
median_weekly_rent 4.823036 1 2.196141
median_age 2.713934 1 1.647402
crPlots(alp_2pp_lm_demog)
ceresPlots(alp_2pp_lm_demog)
Factors skipped in drawing CERES plots.
library(gvlma)
gv_alp_2pp_lm_demog <- gvlma(alp_2pp_lm_demog)
summary(gv_alp_2pp_lm_demog)
Call:
lm(formula = ALP_Percent ~ median_household_income + Demographic +
percent_indig + percent_defacto + percent_born_in_australia +
percent_rent + median_weekly_rent + median_age, data = polling_place_2pp_clean)
Residuals:
Min 1Q Median 3Q Max
-63.420 -8.135 -0.258 8.075 82.895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.235e+02 2.624e+00 47.065 < 2e-16 ***
median_household_income -4.488e-04 1.369e-05 -32.780 < 2e-16 ***
DemographicOuter Metropolitan -1.567e+00 4.622e-01 -3.391 0.000699 ***
DemographicProvincial -2.691e+00 6.274e-01 -4.290 1.81e-05 ***
DemographicRural -1.417e+01 6.081e-01 -23.303 < 2e-16 ***
percent_indig 1.292e+01 2.769e+00 4.665 3.14e-06 ***
percent_defacto 1.524e+02 7.826e+00 19.480 < 2e-16 ***
percent_born_in_australia -1.122e+01 2.141e+00 -5.240 1.65e-07 ***
percent_rent -1.580e+01 2.440e+00 -6.475 1.01e-10 ***
median_weekly_rent 2.988e-02 2.768e-03 10.796 < 2e-16 ***
median_age -1.123e+00 4.175e-02 -26.907 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.3 on 7779 degrees of freedom
(52 observations deleted due to missingness)
Multiple R-squared: 0.3846, Adjusted R-squared: 0.3838
F-statistic: 486.2 on 10 and 7779 DF, p-value: < 2.2e-16
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = alp_2pp_lm_demog)
We know that income is important in the non-rural areas, so it might be worth adding an interaction between rural and non-rural and income.
polling_place_2pp_clean <- polling_place_2pp_clean %>%
mutate(NonRural_Demographic = if_else(Demographic == 'Rural', 1, 0),
ALP_Percent_2013 = ALP_Percent + Swing)
alp_2pp_lm_demog <-
lm(ALP_Percent ~ median_household_income + Demographic + percent_indig +
percent_defacto + percent_born_in_australia +
percent_rent + median_weekly_rent + median_age +
NonRural_Demographic:median_household_income,
data = polling_place_2pp_clean)
summary(alp_2pp_lm_demog)
Call:
lm(formula = ALP_Percent ~ median_household_income + Demographic +
percent_indig + percent_defacto + percent_born_in_australia +
percent_rent + median_weekly_rent + median_age + NonRural_Demographic:median_household_income,
data = polling_place_2pp_clean)
Residuals:
Min 1Q Median 3Q Max
-63.412 -8.045 -0.263 7.812 73.408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.179e+02 2.627e+00 44.891 < 2e-16 ***
median_household_income -5.127e-04 1.434e-05 -35.751 < 2e-16 ***
DemographicOuter Metropolitan -1.916e+00 4.576e-01 -4.188 2.85e-05 ***
DemographicProvincial -4.784e+00 6.394e-01 -7.482 8.11e-14 ***
DemographicRural -3.503e+01 1.660e+00 -21.098 < 2e-16 ***
percent_indig 1.574e+01 2.746e+00 5.731 1.03e-08 ***
percent_defacto 1.491e+02 7.741e+00 19.259 < 2e-16 ***
percent_born_in_australia -8.312e+00 2.128e+00 -3.907 9.43e-05 ***
percent_rent -1.302e+01 2.421e+00 -5.380 7.65e-08 ***
median_weekly_rent 3.336e-02 2.749e-03 12.138 < 2e-16 ***
median_age -9.197e-01 4.395e-02 -20.929 < 2e-16 ***
median_household_income:NonRural_Demographic 2.933e-04 2.176e-05 13.477 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.16 on 7778 degrees of freedom
(52 observations deleted due to missingness)
Multiple R-squared: 0.3987, Adjusted R-squared: 0.3978
F-statistic: 468.8 on 11 and 7778 DF, p-value: < 2.2e-16
anova(alp_2pp_lm_demog)
Analysis of Variance Table
Response: ALP_Percent
Df Sum Sq Mean Sq F value Pr(>F)
median_household_income 1 172 172 1.1608 0.2813
Demographic 3 453116 151039 1021.4652 < 2.2e-16 ***
percent_indig 1 31754 31754 214.7529 < 2.2e-16 ***
percent_defacto 1 74823 74823 506.0242 < 2.2e-16 ***
percent_born_in_australia 1 56968 56968 385.2714 < 2.2e-16 ***
percent_rent 1 6712 6712 45.3919 1.728e-11 ***
median_weekly_rent 1 2532 2532 17.1232 3.540e-05 ***
median_age 1 109539 109539 740.8063 < 2.2e-16 ***
median_household_income:NonRural_Demographic 1 26856 26856 181.6285 < 2.2e-16 ***
Residuals 7778 1150093 148
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(alp_2pp_lm_demog)
vif(alp_2pp_lm_demog)
GVIF Df GVIF^(1/(2*Df))
median_household_income 4.812683 1 2.193783
Demographic 62.371236 3 1.991425
percent_indig 1.685419 1 1.298237
percent_defacto 1.527321 1 1.235848
percent_born_in_australia 3.804766 1 1.950581
percent_rent 3.065885 1 1.750967
median_weekly_rent 4.865961 1 2.205892
median_age 3.077376 1 1.754245
median_household_income:NonRural_Demographic 23.061591 1 4.802249
# crPlots(alp_2pp_lm_demog)
# ceresPlots(alp_2pp_lm_demog)
gv_alp_2pp_lm_demog <- gvlma(alp_2pp_lm_demog)
summary(gv_alp_2pp_lm_demog)
Call:
lm(formula = ALP_Percent ~ median_household_income + Demographic +
percent_indig + percent_defacto + percent_born_in_australia +
percent_rent + median_weekly_rent + median_age + NonRural_Demographic:median_household_income,
data = polling_place_2pp_clean)
Residuals:
Min 1Q Median 3Q Max
-63.412 -8.045 -0.263 7.812 73.408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.179e+02 2.627e+00 44.891 < 2e-16 ***
median_household_income -5.127e-04 1.434e-05 -35.751 < 2e-16 ***
DemographicOuter Metropolitan -1.916e+00 4.576e-01 -4.188 2.85e-05 ***
DemographicProvincial -4.784e+00 6.394e-01 -7.482 8.11e-14 ***
DemographicRural -3.503e+01 1.660e+00 -21.098 < 2e-16 ***
percent_indig 1.574e+01 2.746e+00 5.731 1.03e-08 ***
percent_defacto 1.491e+02 7.741e+00 19.259 < 2e-16 ***
percent_born_in_australia -8.312e+00 2.128e+00 -3.907 9.43e-05 ***
percent_rent -1.302e+01 2.421e+00 -5.380 7.65e-08 ***
median_weekly_rent 3.336e-02 2.749e-03 12.138 < 2e-16 ***
median_age -9.197e-01 4.395e-02 -20.929 < 2e-16 ***
median_household_income:NonRural_Demographic 2.933e-04 2.176e-05 13.477 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.16 on 7778 degrees of freedom
(52 observations deleted due to missingness)
Multiple R-squared: 0.3987, Adjusted R-squared: 0.3978
F-statistic: 468.8 on 11 and 7778 DF, p-value: < 2.2e-16
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = alp_2pp_lm_demog)
Using a lasso regression,
library(caret)
library(glmnet)
polling_place_2pp_clean_NA <- na.omit(polling_place_2pp_clean) %>%
dplyr::select(ALP_Percent, StateAb, median_age, median_household_income,
average_household_size, persons_per_bedroom, median_weekly_rent,
median_annual_mortgage, percent_female, percent_defacto,
percent_married, percent_indig, percent_born_in_australia,
percent_unit, percent_mortgage, percent_rent, Demographic)
# Inspect the data
sample_n(polling_place_2pp_clean_NA, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- polling_place_2pp_clean_NA$ALP_Percent %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- polling_place_2pp_clean_NA[training.samples, ]
test.data <- polling_place_2pp_clean_NA[-training.samples, ]
# Dumy code categorical predictor variables
x <- model.matrix(ALP_Percent~., train.data)[,-1]
# Convert the outcome (class) to a numerical variable
y <- train.data$ALP_Percent
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "gaussian")
# Fit the final model on the training data
alp_2pp_lasso_demog <- glmnet(x, y, alpha = 1, family = "gaussian",
lambda = cv.lasso$lambda.min)
# Display regression coefficients
coef(alp_2pp_lasso_demog)
25 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 1.506842e+01
StateAbNSW -1.054087e+01
StateAbNT -7.804646e+00
StateAbQLD -1.630224e+01
StateAbSA -1.248631e+01
StateAbTAS -7.533343e+00
StateAbVIC -1.316506e+01
StateAbWA -2.007797e+01
median_age 3.361810e-01
median_household_income -2.954546e-04
average_household_size 6.490529e+00
persons_per_bedroom 3.567807e+01
median_weekly_rent 1.391279e-02
median_annual_mortgage 1.118316e-04
percent_female 3.578588e+01
percent_defacto 1.719937e+02
percent_married -8.101607e+01
percent_indig -4.221851e+01
percent_born_in_australia -2.354045e+01
percent_unit -4.874705e+01
percent_mortgage 4.682609e+01
percent_rent 4.530053e+01
DemographicOuter Metropolitan -3.125436e+00
DemographicProvincial -3.269083e+00
DemographicRural -1.363746e+01
summary(alp_2pp_lasso_demog)
Length Class Mode
a0 1 -none- numeric
beta 24 dgCMatrix S4
df 1 -none- numeric
dim 2 -none- numeric
lambda 1 -none- numeric
dev.ratio 1 -none- numeric
nulldev 1 -none- numeric
npasses 1 -none- numeric
jerr 1 -none- numeric
offset 1 -none- logical
call 6 -none- call
nobs 1 -none- numeric
# Make predictions on the test data
x.test <- model.matrix(ALP_Percent ~., test.data)[,-1]
predictions <- alp_2pp_lasso_demog %>% predict(newx = x.test)
# Model accuracy
observed <- test.data$ALP_Percent
plot(predictions, observed)
cor(predictions, observed)
[,1]
s0 0.7230284
cor(predictions, observed)^2
[,1]
s0 0.5227701
What if we include the 2pp from the last election. It appears that Swing is defined as the swing to the Coalition between 2013 and 2016. So if we add the swing to the 2016 2pp, then we obtain the 2013 2pp.
When using a Linear Model, we can get an R^2 of about 84%
polling_place_2pp_clean <- polling_place_2pp_clean %>%
mutate(logit_ALP_Percent = log((ALP_Percent/100)/(1-ALP_Percent/100)),
logit_ALP_Percent_2013 = log((ALP_Percent_2013/100)/(1-ALP_Percent_2013/100))) %>%
filter(!is.nan(logit_ALP_Percent)) %>%
filter(!is.na(logit_ALP_Percent)) %>%
filter(!is.infinite(logit_ALP_Percent)) %>%
filter(!is.nan(logit_ALP_Percent_2013)) %>%
filter(!is.na(logit_ALP_Percent_2013)) %>%
filter(!is.infinite(logit_ALP_Percent_2013))
NaNs produced
alp_2pp_lm_demog_lag <-
lm(logit_ALP_Percent ~ percent_indig + percent_defacto +
percent_rent + median_weekly_rent + median_age +
NonRural_Demographic*median_household_income + logit_ALP_Percent_2013,
data = polling_place_2pp_clean)
summary(alp_2pp_lm_demog_lag)
Call:
lm(formula = logit_ALP_Percent ~ percent_indig + percent_defacto +
percent_rent + median_weekly_rent + median_age + NonRural_Demographic *
median_household_income + logit_ALP_Percent_2013, data = polling_place_2pp_clean)
Residuals:
Min 1Q Median 3Q Max
-2.1110 -0.1297 -0.0042 0.1305 3.1633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.096e+00 5.178e-02 21.164 < 2e-16 ***
percent_indig 4.115e-01 6.038e-02 6.816 1.00e-11 ***
percent_defacto 5.801e-01 1.692e-01 3.428 0.000612 ***
percent_rent -3.884e-01 4.807e-02 -8.080 7.48e-16 ***
median_weekly_rent 2.500e-04 5.992e-05 4.172 3.05e-05 ***
median_age -1.478e-02 9.831e-04 -15.030 < 2e-16 ***
NonRural_Demographic -2.116e-01 3.464e-02 -6.108 1.06e-09 ***
median_household_income -5.178e-06 3.219e-07 -16.086 < 2e-16 ***
logit_ALP_Percent_2013 8.974e-01 6.057e-03 148.158 < 2e-16 ***
NonRural_Demographic:median_household_income 2.094e-06 4.902e-07 4.272 1.96e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2802 on 7733 degrees of freedom
(52 observations deleted due to missingness)
Multiple R-squared: 0.8392, Adjusted R-squared: 0.839
F-statistic: 4485 on 9 and 7733 DF, p-value: < 2.2e-16
anova(alp_2pp_lm_demog_lag)
Analysis of Variance Table
Response: logit_ALP_Percent
Df Sum Sq Mean Sq F value Pr(>F)
percent_indig 1 0.02 0.02 0.2255 0.6349
percent_defacto 1 78.03 78.03 993.6088 < 2.2e-16 ***
percent_rent 1 415.03 415.03 5285.0079 < 2.2e-16 ***
median_weekly_rent 1 28.44 28.44 362.1399 < 2.2e-16 ***
median_age 1 225.78 225.78 2875.1283 < 2.2e-16 ***
NonRural_Demographic 1 290.25 290.25 3696.0787 < 2.2e-16 ***
median_household_income 1 366.95 366.95 4672.7058 < 2.2e-16 ***
logit_ALP_Percent_2013 1 1763.84 1763.84 22460.9325 < 2.2e-16 ***
NonRural_Demographic:median_household_income 1 1.43 1.43 18.2460 1.965e-05 ***
Residuals 7733 607.27 0.08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(alp_2pp_lm_demog_lag)
vif(alp_2pp_lm_demog_lag)
percent_indig percent_defacto
1.502748 1.366026
percent_rent median_weekly_rent
2.262058 4.328613
median_age NonRural_Demographic
2.884030 27.998884
median_household_income logit_ALP_Percent_2013
4.532365 1.598839
NonRural_Demographic:median_household_income
21.897932
# crPlots(alp_2pp_lm_demog_lag)
# ceresPlots(alp_2pp_lm_demog_lag)
gv_alp_2pp_lm_demog_lag <- gvlma(alp_2pp_lm_demog_lag)
summary(alp_2pp_lm_demog_lag)
Call:
lm(formula = logit_ALP_Percent ~ percent_indig + percent_defacto +
percent_rent + median_weekly_rent + median_age + NonRural_Demographic *
median_household_income + logit_ALP_Percent_2013, data = polling_place_2pp_clean)
Residuals:
Min 1Q Median 3Q Max
-2.1110 -0.1297 -0.0042 0.1305 3.1633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.096e+00 5.178e-02 21.164 < 2e-16 ***
percent_indig 4.115e-01 6.038e-02 6.816 1.00e-11 ***
percent_defacto 5.801e-01 1.692e-01 3.428 0.000612 ***
percent_rent -3.884e-01 4.807e-02 -8.080 7.48e-16 ***
median_weekly_rent 2.500e-04 5.992e-05 4.172 3.05e-05 ***
median_age -1.478e-02 9.831e-04 -15.030 < 2e-16 ***
NonRural_Demographic -2.116e-01 3.464e-02 -6.108 1.06e-09 ***
median_household_income -5.178e-06 3.219e-07 -16.086 < 2e-16 ***
logit_ALP_Percent_2013 8.974e-01 6.057e-03 148.158 < 2e-16 ***
NonRural_Demographic:median_household_income 2.094e-06 4.902e-07 4.272 1.96e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2802 on 7733 degrees of freedom
(52 observations deleted due to missingness)
Multiple R-squared: 0.8392, Adjusted R-squared: 0.839
F-statistic: 4485 on 9 and 7733 DF, p-value: < 2.2e-16
It would be interesting to see whether 2013 alone is a good predictor. Fitting this model gives an R^2 of 82.7%. This means that the other variables add a bit, but not a huge amount.
alp_2pp_lm_demog_lag <-
lm(logit_ALP_Percent ~ logit_ALP_Percent_2013,
data = polling_place_2pp_clean)
summary(alp_2pp_lm_demog_lag)
Call:
lm(formula = logit_ALP_Percent ~ logit_ALP_Percent_2013, data = polling_place_2pp_clean)
Residuals:
Min 1Q Median 3Q Max
-2.2357 -0.1441 -0.0115 0.1335 3.2092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.127440 0.003472 36.7 <2e-16 ***
logit_ALP_Percent_2013 0.954281 0.004944 193.0 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2902 on 7793 degrees of freedom
Multiple R-squared: 0.827, Adjusted R-squared: 0.827
F-statistic: 3.725e+04 on 1 and 7793 DF, p-value: < 2.2e-16
anova(alp_2pp_lm_demog_lag)
Analysis of Variance Table
Response: logit_ALP_Percent
Df Sum Sq Mean Sq F value Pr(>F)
logit_ALP_Percent_2013 1 3136.83 3136.83 37249 < 2.2e-16 ***
Residuals 7793 656.27 0.08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(alp_2pp_lm_demog_lag)
There are three outlying values - let’s explore these
polling_place_2pp_clean[c(2079, 7267, 2758),]
These booths have fewer than 40 electors, and fewer than 5 for one of the two parties. They are also ‘non-standard’ booths.
It might be interesting to see which booths deviate from the division mean.